Wohlig Bet Prediction

Index

1. Data Load
    1.1 Merge Data

2. EDA (Elementary Data Analysis)
    2.1 Structure of the Data
    2.2 Descriptive Stats
    2.3 Data Imbalance Check
    2.4 Check for missing values
    2.5 Feature Analysis
        2.5.1 Categorical Feature Analysis
        2.5.2 Numerical Features Analysis
    2.6 Correlation Analysis

3. Data Standardization
    3.1 Dropping useless features based on EDA and Correlation
     3.2 Missing Value Imputation
     3.3 Train-Test spit
     3.4 Feature Encoding

4. Feature Selection
    4.1 For Numerical Data
    4.2 For Categorical Data


5. Build Data Matrix for Models


6. TSNE Visualization


7. Models
    7.1 Machine Learning Models
        7.1.1 Random Model
        7.1.1 Logistic Regression (SGD) with Hyperparameter Tuning
        7.1.3 Logistic Regression (Sklearn)
        7.1.4 KNN
        7.1.5 Decision Tree Model
        7.1.6 Random Forest Model
        7.1.7 XGBoost Model
    7.2 Deep Learning Model
        7.2.1 Neural Network (ANN) Model


8. Comparison


9. Conclusion

Machine Learning Objective

1. To build a high accuracy model for the Binary Classification.

Importing Libraries / Modules

1. Load Data

1.1 Merge Data

Save dataframe to csv file.

2. EDA (Elementary Data Analysis)

2.1 Structure of the data

Observation :

2.2 Descriptive Stats

Observation :

2.3 Data Imbalance Check

Observation :

2.4 Check for missing values

Observation :

2.5 Feature Analysis

_id

Observation :

2.5.1 Categorical Features Analysis

type

Observation :

horse

Observation :

marketId

Observation :

Observation :

IP

IP have 3 missing values.

Observation :

Feature Engineering on IP

Add Geological Information from IP

Image generated using Tableau

Observation :

eventType

Observation :

userName

Observation :

selectionName

Observation :

marketName

Observation :

event

winnerId

2.5.2 Numerical Features Analysis

stake

Observation :

Lets do more analysis.

Observation :

Log Transform

Observation :

boxcox Transform

Observation :

betRate

Observation :

Log transform

Observation :

Boxcox Transformation

Observation :

averagePriceMatched

Observation :

Log transformation

Observation :

BosCox Transformation

Observation :

Save Lambda's of BoxCox Transform

placedDate

Convert placedDate dtype to DateTime Format

Date

Observation :

Time

hour

Observation :

week of the year

Observation :

weekday (7 week days)

Observation :

TimeSeries Plot Generated Using Tableau

Observation :

Save Dataframe

2.6 Correlation Analysis

Compute pairwise correlation of columns, excluding NA/null values.

Observation :

3. Data Standardization

3.1 Dropping useless features based on EDA and Correlation:

Feature Name Included Reason
_id No Unique identifier for bets
stake No stake_boxcox is used.
stake_log No stake_boxcox is used.
stake_boxcox Yes
type Yes
horse Yes
betRate No betRate_boxcox is used.
betRate_log No betRate_boxcox is used.
betRate_boxcox Yes
marketId No 390 category
eventType Yes
userName No 410 category
selectionName No 318 category
marketName Yes
event Yes
averagePriceMatched No averagePriceMatched_boxcox is used
averagePriceMatched_log No averagePriceMatched_boxcox is used
averagePriceMatched_boxcox No
winnerID No 195 category
IP No Unique identifier for IP
Details (IP) No list of details about IP.
city No Dropping here but can be useful for large data where bets have geographical correlation.
region No Dropping here but can be useful for large data where bets have geographical correlation.
country No Dropping here but can be useful for large data where bets have geographical correlation.
loc No Dropping here but can be useful for large data where bets have geographical correlation.
org No Dropping here but can be useful for large data where bets have geographical correlation.
postal No Dropping here but can be useful for large data where bets have geographical correlation.
timezone No Dropping here but can be useful for large data where bets have geographical correlation.
Lattitude No Dropping here but can be useful for large data where bets have geographical correlation.
Longitude No Dropping here but can be useful for large data where bets have geographical correlation.
placedDate No Unique Identifier for date and time when bet is placed.
Date No Since in EDA it show all bets are INVALID_BET before 19 JAN, it will overfit the model since it makes prediction on date.
time No use in hour.
hour Yes
week_of_the_year No Can cause overfitting due to temporal nature of the data.
weekday yes

3.2 Missing Value Imputation

Here we does not have any missing value but we keep missing value imputation code, since using some other features like postal have missing values.

3.3 Train-Test spit

3.4 Feature Encoding

For Numerical features

For Categorical Features

4. Feature Selection

4.1 For Numerical Data

Observation :

4.2 For Categorical Data

5. Build Data Matrix for Models

6. TSNE Visualization

Observation :

7. Models

7.1 Machine Learning Models

7.1.1 Random Model

Observation :

7.1.2 Logistic Regression (SGD) with Hyperparameter Tuning

Observation :

7.1.3 Logistic Regression (Sklearn)

Observation :

7.1.4 KNN

Observation :

7.1.5 Decision Tree Model

Observation :

7.1.6 Random Forest Model

Observation :

7.1.7 XGBoost Model

Observation :

7.2 Deep Learning Model

7.2.1 Neural Network (ANN) Model

Observation :

8. Comparison

9. Conclusion